Move Parallelism usage from Apex -> Megatron Core #6393

aklife97 · 2023-04-07T15:50:45Z

What does this PR do ?

This PR moves model parallelism in NeMo to use Megatron-core instead of Apex.

All models use dataloader_iter in train and eval step (except eval for T5 p-tuning)
All pre-training models that use a constant sequence length sample based on microbatches
All downstream tasks - p-tuning, adapter etc - that depend on dynamic seq len use global batch based sampling

We still use Apex for microbatch calculator, some enums/utils [both to be soon shifted to core], LayerNorm/other things with kernel

Collection: NLP

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ericharper <[email protected]>

Signed-off-by: Abhinav Khattar <[email protected]>

Signed-off-by: ericharper <[email protected]>

Signed-off-by: SeanNaren <[email protected]>

* fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

Signed-off-by: Abhinav Khattar <[email protected]>

* Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]>

Signed-off-by: Abhinav Khattar <[email protected]>

…integrate_core

Signed-off-by: Abhinav Khattar <[email protected]>

Signed-off-by: ericharper <[email protected]>

Signed-off-by: Abhinav Khattar <[email protected]>

…integrate_core

Signed-off-by: Abhinav Khattar <[email protected]>

Signed-off-by: SeanNaren <[email protected]>

Signed-off-by: Abhinav Khattar <[email protected]>

github-advanced-security

CodeQL found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

nemo/collections/nlp/modules/common/megatron/build_model.py

Signed-off-by: Abhinav Khattar <[email protected]>

…integrate_core

Signed-off-by: Abhinav Khattar <[email protected]>

ericharper

LGTM. Thank you!

This PR was a massive effort. Thanks to all for their contributions and especially @aklife97 for putting it all together here.

Signed-off-by: Abhinav Khattar <[email protected]>

ericharper

LGTM. Thanks!

* import parallel_state and tensor_parallel from megatron.core Signed-off-by: ericharper <[email protected]> * update column parallel async allreduce arg Signed-off-by: ericharper <[email protected]> * typos Signed-off-by: ericharper <[email protected]> * play stash + some changes Signed-off-by: Abhinav Khattar <[email protected]> * make grad scaler callable Signed-off-by: ericharper <[email protected]> * Fixed formatting Signed-off-by: SeanNaren <[email protected]> * Make sure RETRO integrates well with the core (NVIDIA#6207) * fix tests Signed-off-by: Yi Dong <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * [NLP] Support T5 with Megatron Core (NVIDIA#6222) * Support T5 with Megatron Core Signed-off-by: SeanNaren <[email protected]> * Remove comment Signed-off-by: SeanNaren <[email protected]> * Update prediction step Signed-off-by: SeanNaren <[email protected]> * Further changes to fix fine-tuning Signed-off-by: SeanNaren <[email protected]> * Bug fixes from runs Signed-off-by: SeanNaren <[email protected]> * Revert changes to batch sampler, swap to pretrained sampler Signed-off-by: SeanNaren <[email protected]> * Address feedback Signed-off-by: SeanNaren <[email protected]> --------- Signed-off-by: SeanNaren <[email protected]> * GPT P-tuning core (max_len pad -> slow) Signed-off-by: Abhinav Khattar <[email protected]> * add GPT p-tuning w/ global batch based passing Signed-off-by: Abhinav Khattar <[email protected]> * add T5 p-tuning support Signed-off-by: Abhinav Khattar <[email protected]> * add megatron core install to Jenkinsfile Signed-off-by: ericharper <[email protected]> * fix command Signed-off-by: ericharper <[email protected]> * add guard efault for arg Signed-off-by: ericharper <[email protected]> * shift bert, retro, adapter + other namespace changes Signed-off-by: Abhinav Khattar <[email protected]> * build_model merge into one Signed-off-by: Abhinav Khattar <[email protected]> * Ensure fine-tuning/prompt learning work for T5 (NVIDIA#6385) Signed-off-by: SeanNaren <[email protected]> * rm extra split impl Signed-off-by: Abhinav Khattar <[email protected]> * fix for CI Signed-off-by: Abhinav Khattar <[email protected]> * temp change for tests Signed-off-by: Abhinav Khattar <[email protected]> * add bs=1 for log Signed-off-by: Abhinav Khattar <[email protected]> * fix Signed-off-by: Abhinav Khattar <[email protected]> * iter changes NMT Signed-off-by: Abhinav Khattar <[email protected]> * NMT partial fix Signed-off-by: Abhinav Khattar <[email protected]> * move on_train_batch_end to base_model Signed-off-by: Abhinav Khattar <[email protected]> * rm on_train_batch_end Signed-off-by: Abhinav Khattar <[email protected]> * temp remove NMT test Signed-off-by: Abhinav Khattar <[email protected]> * add training_step logic for T5 derived dynamic len models Signed-off-by: Abhinav Khattar <[email protected]> * add NMT test back Signed-off-by: Abhinav Khattar <[email protected]> * style fix Signed-off-by: Abhinav Khattar <[email protected]> * change no_async_tensor_model_parallel_allreduce Signed-off-by: Abhinav Khattar <[email protected]> * sequence_parallel_enabled -> sequence_parallel Signed-off-by: Abhinav Khattar <[email protected]> * fix T5 FT batch size Signed-off-by: Abhinav Khattar <[email protected]> * seq enabled Signed-off-by: Abhinav Khattar <[email protected]> * T5 sequence length fix Signed-off-by: Abhinav Khattar <[email protected]> * NMT mp fork to spawn Signed-off-by: Abhinav Khattar <[email protected]> * make function signatures consistent across models Signed-off-by: Abhinav Khattar <[email protected]> * make print log Signed-off-by: Abhinav Khattar <[email protected]> * rm unused import Signed-off-by: Abhinav Khattar <[email protected]> * update Dockerfile to install core Signed-off-by: Abhinav Khattar <[email protected]> * keep core path in workspace Signed-off-by: Abhinav Khattar <[email protected]> --------- Signed-off-by: ericharper <[email protected]> Signed-off-by: Abhinav Khattar <[email protected]> Signed-off-by: SeanNaren <[email protected]> Signed-off-by: Yi Dong <[email protected]> Co-authored-by: ericharper <[email protected]> Co-authored-by: SeanNaren <[email protected]> Co-authored-by: Yi Dong <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <[email protected]>

ericharper and others added 14 commits October 5, 2022 10:00

import parallel_state and tensor_parallel from megatron.core

23a46cb

Signed-off-by: ericharper <[email protected]>

update column parallel async allreduce arg

13b1572

Signed-off-by: ericharper <[email protected]>

typos

957e1b0

Signed-off-by: ericharper <[email protected]>

merge main into base branch

3bd627c

Signed-off-by: Abhinav Khattar <[email protected]>

play stash + some changes

59e7772

Signed-off-by: Abhinav Khattar <[email protected]>

make grad scaler callable

571dbc7

Signed-off-by: ericharper <[email protected]>

Fixed formatting

14721f2

Signed-off-by: SeanNaren <[email protected]>

merge main to base branch

3180db5

Signed-off-by: Abhinav Khattar <[email protected]>

GPT P-tuning core (max_len pad -> slow)

1ff66aa

Signed-off-by: Abhinav Khattar <[email protected]>

add GPT p-tuning w/ global batch based passing

874286b

Signed-off-by: Abhinav Khattar <[email protected]>

Merge branch 'GPT_integrate_core' of github.com:NVIDIA/NeMo into GPT_…

3cb4230

…integrate_core

add T5 p-tuning support

f92e725

Signed-off-by: Abhinav Khattar <[email protected]>

github-actions bot added core Changes to NeMo Core NLP labels Apr 7, 2023

aklife97 requested a review from ericharper April 7, 2023 15:52

add megatron core install to Jenkinsfile

eb68b32

Signed-off-by: ericharper <[email protected]>

github-actions bot added the CI label Apr 7, 2023

ericharper and others added 5 commits April 7, 2023 11:49

fix command

d6c9c15

Signed-off-by: ericharper <[email protected]>

add guard efault for arg

304cc1c

Signed-off-by: ericharper <[email protected]>

shift bert, retro, adapter + other namespace changes

9fe410d

Signed-off-by: Abhinav Khattar <[email protected]>

Merge branch 'GPT_integrate_core' of github.com:NVIDIA/NeMo into GPT_…

d784b1d

…integrate_core

build_model merge into one

df1d5d1

Signed-off-by: Abhinav Khattar <[email protected]>

aklife97 marked this pull request as ready for review April 7, 2023 21:28

SeanNaren and others added 2 commits April 7, 2023 15:24

Ensure fine-tuning/prompt learning work for T5 (#6385)

a3db3aa

Signed-off-by: SeanNaren <[email protected]>

rm extra split impl

1ba9fa6

Signed-off-by: Abhinav Khattar <[email protected]>

aklife97 requested a review from okuchaiev April 7, 2023 22:35

aklife97 added 2 commits April 7, 2023 15:40

fix for CI

8330183

Signed-off-by: Abhinav Khattar <[email protected]>

temp change for tests

0d27220

Signed-off-by: Abhinav Khattar <[email protected]>

aklife97 added 11 commits April 10, 2023 21:45

add NMT test back

f9585b1

Signed-off-by: Abhinav Khattar <[email protected]>

style fix

492cd90

Signed-off-by: Abhinav Khattar <[email protected]>

change no_async_tensor_model_parallel_allreduce

f068656

Signed-off-by: Abhinav Khattar <[email protected]>

sequence_parallel_enabled -> sequence_parallel

1c07e05

Signed-off-by: Abhinav Khattar <[email protected]>

fix T5 FT batch size

17e69fc

Signed-off-by: Abhinav Khattar <[email protected]>

seq enabled

35173a7

Signed-off-by: Abhinav Khattar <[email protected]>

T5 sequence length fix

644b2f5

Signed-off-by: Abhinav Khattar <[email protected]>

NMT mp fork to spawn

ff94631

Signed-off-by: Abhinav Khattar <[email protected]>

make function signatures consistent across models

47f2e01

Signed-off-by: Abhinav Khattar <[email protected]>

merge main into branch

e2f174e

Signed-off-by: Abhinav Khattar <[email protected]>

Merge branch 'main' into GPT_integrate_core

7216232

github-advanced-security bot found potential problems Apr 11, 2023

View reviewed changes

ericharper reviewed Apr 11, 2023

View reviewed changes

nemo/collections/nlp/modules/common/megatron/build_model.py Outdated Show resolved Hide resolved

aklife97 added 3 commits April 11, 2023 16:17

make print log

71ea0e7

Signed-off-by: Abhinav Khattar <[email protected]>

Merge branch 'GPT_integrate_core' of github.com:NVIDIA/NeMo into GPT_…

4675db9

…integrate_core

rm unused import

8ce981c

Signed-off-by: Abhinav Khattar <[email protected]>

ericharper previously approved these changes Apr 11, 2023

View reviewed changes

aklife97 added 2 commits April 11, 2023 17:15

update Dockerfile to install core

90f19d6

Signed-off-by: Abhinav Khattar <[email protected]>

keep core path in workspace

6fb8ee9

Signed-off-by: Abhinav Khattar <[email protected]>

aklife97 dismissed ericharper’s stale review via 90f19d6 April 12, 2023 00:24

ericharper approved these changes Apr 12, 2023

View reviewed changes

ericharper merged commit 7854bd4 into main Apr 13, 2023

ericharper deleted the GPT_integrate_core branch April 13, 2023 16:42

This was referenced May 2, 2023

Optimize communication in interleaved pipeline parallelism NVIDIA/Megatron-LM#331

Closed

Restore GPT support for interleaved pipeline parallelism #6528

Merged

timmoon10 mentioned this pull request May 10, 2023

AttributeError: module 'megatron.core.parallel_state' has no attribute 'get_amax_reduction_group' #6625

Closed

This was referenced May 26, 2023

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6739

Closed

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6740

Merged

github-actions bot mentioned this pull request Jun 1, 2023

Debug Transformer Engine FP8 support with Megatron-core infrastructure #6791

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Parallelism usage from Apex -> Megatron Core #6393

Move Parallelism usage from Apex -> Megatron Core #6393

aklife97 commented Apr 7, 2023 •

edited

Loading

github-advanced-security bot left a comment

ericharper left a comment

ericharper left a comment

Move Parallelism usage from Apex -> Megatron Core #6393

Move Parallelism usage from Apex -> Megatron Core #6393

Conversation

aklife97 commented Apr 7, 2023 • edited Loading

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

github-advanced-security bot left a comment

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

ericharper left a comment

Choose a reason for hiding this comment

aklife97 commented Apr 7, 2023 •

edited

Loading